DS4PS


As an expert data scientist, you have been hired by the Mayor of Tempe to conduct a study and make recommendations on ways to reduce traffic injuries in the city. You will use the crash data from the city’s open data portal and your data wrangling skills to look for patterns that help us understand the causes of traffic accidents in the city, and might suggest some ways to reduce injuries and fatalities.

Consider the following questions:











Packages

You will use the following packages for this lab:


Data

City of Tempe traffic accident data available on their Open Data Portal.



Working with Dates

So far we have worked with character, numeric, logical, and categorical (factor) vectors.

We need to introduce a new type of vector class for this lab, a date variable. Dates are complicated because they must function simultaneously as categorical variables (months of the year) and numeric variables capable of arithmatic (time that passes between two dates). Furthermore, we often want to convert between idiosyncratic date representations, such as a day of a specific month to a day of the week.

They are also complicated because they can be represented in many ways:

When you first read a dataset, they are typically loaded as character vectors:

## [1] "1/10/12 9:04 "  "1/5/12 17:24 "  "1/16/12 19:08 " "1/27/12 14:41 "
## [5] "1/10/12 13:41 " "1/9/12 17:49 "

We can convert dates stored as characters to a special date object by specifying the format using codes understood by the strptime() function:

## [1] "2012-01-10 09:04:00 MST" "2012-01-05 17:24:00 MST"
## [3] "2012-01-16 19:08:00 MST" "2012-01-27 14:41:00 MST"
## [5] "2012-01-10 13:41:00 MST" "2012-01-09 17:49:00 MST"

Now R will recognize that the variable is a date, not a string, and it will be able to do complex day and time manipulations. Note that the format= argument above requires you to tell R what each value represents. In this case, %m represents month, %d represents day, and %y represents year. In the original data they are separated by a back slash, so that’s included in the format argument.

We need to be explicit because dates can be stored as DD-MM-YYYY, MM-DD-YY, YYYY-MM-DD, or any other number of formats. The format argument tells R how to structure the date.

We can now use the format() function to specify how we want the date represented using many common styles. Note that format() will return a character vector, not another date class.

## [1] "09" "17" "19" "14" "13" "17"
## [1] "09" "05" "07" "02" "01" "05"
## [1] "AM" "PM" "PM" "PM" "PM" "PM"
## [1] "01" "01" "01" "01" "01" "01"
## [1] "Jan" "Jan" "Jan" "Jan" "Jan" "Jan"
## [1] "Tuesday"  "Thursday" "Monday"   "Friday"   "Tuesday"  "Monday"
## [1] "Tue" "Thu" "Mon" "Fri" "Tue" "Mon"


Date Formatting Options

We can apply a wide range of formatting options to dates:

  • %a: Abbreviated weekday name in the current locale on this platform. (Also matches full name on input: in some locales there are no abbreviations of names.)

  • %A: Full weekday name in the current locale. (Also matches abbreviated name on input.)

  • %b: Abbreviated month name in the current locale on this platform. (Also matches full name on input: in some locales there are no abbreviations of names.)

  • %B: Full month name in the current locale. (Also matches abbreviated name on input.)

  • %c: Date and time. Locale-specific on output, “%a %b %e %H:%M:%S %Y” on input.

  • %C: Century (00–99): the integer part of the year divided by 100.

  • %d: Day of the month as decimal number 01–31.

  • %D: Date format such as %m/%d/%y: the C99 standard says it should be that exact format, but not all OS’s comply.

  • %e: Day of the month as decimal number 1–31, with a leading space for a single-digit number.

  • %F: Equivalent to %Y-%m-%d the ISO 8601 date format.

  • %g: The last two digits of the week-based year. Accepted but ignored on input.

  • %G: The week-based year as a decimal number. Accepted but ignored on input.

  • %h: Equivalent to %b.

  • %H: Hours as decimal number 00–23. As a special exception strings such as 24:00:00 are accepted for input, since ISO 8601 allows these.

  • %I: Hours as decimal number 01–12.

  • %j: Day of year as decimal number 001–366.

  • %m: Month as decimal number 01–12.

  • %M: Minute as decimal number 00–59.

  • %n: Newline on output, arbitrary whitespace on input.

  • %p: AM/PM indicator in the locale. Used in conjunction with %I and not with %H. An empty string in some locales (and the behaviour is undefined if used for input in such a locale). Some platforms accept %P for output, which uses a lower-case version: others will output P.

  • %r: The 12-hour clock time (using the locale’s AM or PM). Only defined in some locales.

  • %R: Equivalent to %H:%M.

  • %S: Second as integer 00–61, allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).

  • %t: Tab on output, arbitrary whitespace on input.

  • %T: Equivalent to %H:%M:%S.

  • %u: Weekday as a decimal number 1–7, Monday is 1.

  • %U: Week of the year as decimal number 00–53 using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.

  • %V: Week of the year as decimal number 01–53 as defined in ISO 8601. If the week (starting on Monday) containing 1 January has four or more days in the new year, then it is considered week 1. Otherwise, it is the last week of the previous year, and the next week is week 1. (Accepted but ignored on input.)

  • %w: Weekday as decimal number 0–6, Sunday is 0.

  • %W: Week of the year as decimal number 00–53 using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.

  • %x: Date. Locale-specific on output, “%y/%m/%d” on input.

  • %X: Time. Locale-specific on output, “%H:%M:%S” on input.

  • %y: Year without century 00–99. On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 and 2008 POSIX standards, but they do also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.

  • %Y: Year with century. Note that whereas there was no zero in the original Gregorian calendar, ISO 8601:2004 defines it to be valid (interpreted as 1BC): see https://en.wikipedia.org/wiki/0_(year). Note that the standards also say that years before 1582 in its calendar should only be used with agreement of the parties involved. For input, only years 0:9999 are accepted.

  • %z: Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC. Values up to +1400 are accepted as from R 3.1.1: previous versions only accepted up to +1200. (Standard only for output.)

  • %Z: (Output only.) Time zone abbreviation as a character string (empty if not available). This may not be reliable when a time zone has changed abbreviations over the years.

Creating New Date Variables



If we want to be more precise about crash counts per week within a given year, which is a more intuitive and actionable statistic than summing across all years in the dataset:



Recode Factor Levels

Some of the categorical variables are hard to work with because they have a levels that are small or hard to interpret.

Collisionmanner n
10 3
Rear To Rear 56
Rear To Side 174
Sideswipe Opposite Direction 189
Unknown 345
Head On 348
Other 971
Single Vehicle 1737
Sideswipe Same Direction 3565
ANGLE (Front To Side)(Other Than Left Turn) 4686
Left Turn 5395
Rear End 11001
Collisionmanner n
Head On 348
Single Vehicle 1737
Lane Change 3565
Angle 4686
Left Turn 5395
Rear End 11001

Patterns in types of crashes by time of day:

  Angle Head On Left Turn Rear End Lane Change Single Vehicle
00 71 4 59 83 50 89
01 34 8 32 78 28 87
02 40 7 23 87 34 136
03 21 8 13 35 7 92
04 18 6 23 17 5 50
05 49 7 63 73 39 49
06 127 9 117 213 82 57
07 286 19 359 616 180 69
08 258 13 279 623 192 60
09 212 11 180 360 174 63
10 217 6 188 398 153 51
11 250 12 246 531 203 58
12 333 14 284 729 210 56
13 347 21 325 738 233 54
14 353 19 334 764 295 64
15 379 24 420 998 303 79
16 397 23 582 1200 323 87
17 427 36 619 1285 337 90
18 289 22 421 811 236 76
19 186 19 249 437 143 63
20 135 26 182 316 111 83
21 114 9 177 294 98 74
22 76 13 130 192 76 66
23 67 12 90 123 53 84

Age of Driver

We have a wide range of driver ages:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
2 22 31 43.64 51 255 360

This many ages will make our analysis complicated, so it is better to convert the numeric age variable into a categorical age-group variable. We will use the cut() function for this, which accepts a numeric variable and group cut points (the breaks= argument), then returns the proper group label for each age.

These group labels are a little awkward, so let’s improve them a bit by creating our own:

We can now analyze some trends by age group.



Lab Instructions

In this lab you will practice your logical statements, data verbs (dplyr functions), and recipes to conduct analysis looking for types of accidents that cause serious injury. You will need to pay attention to the difference between counts of events, and severity of events. We will define “harm” as any accident that causes at least one injury or fatality.

You can create a new RMarkdown file, or download the lab template: RMD Template



PART 1: Summary Stats

Practice writing logical statements and basic data recipes for the following:

1) How many accidents happen on Mondays?

2) What proportion of accidents each week occur on Monday?

3) What proportion of accidents on Mondays result in harm?

4) What is the most typical type of accident (Collisionmanner) that occurs on Mondays?



PART 2: Rates of Harm

As a public health expert specializing in traffic accidents, you need to think about how to best target traffic accidents to reduce harm. Should we focus on the volume of traffic accidents, or the types of accidents that are most likely to cause harm?

Calculate each of these four descriptive statistics above as a function of the 24 hours of the day, and either print a table with times and counts/rates, or plot a graph of the statistics as a function of time similar to the exaples above.



PART 3: Most Dangerous Accidents

Using at most two variables in the dataset to define your groups, identify the following:

1) The most dangerous accident to be involved in (highest rate of harm).

2) The type of accident that hurts the most citizens.

For example, it could be teen-agers (group 1: age) that rear-end another driver (group 2: collision type), or drunk-drivers (group 1: alcohol) that hit pedestrians (group 2: driver type), or men (group 1: gender) on Labor Day (group 2: date).

You can use any variables from the dataset, but you are limit to groups constructed from two variables. Report your findings. There will be a prize for the individual that finds the most harmful types of accidents.



Submission Instructions

After you have completed your lab, knit your RMD file. Login to Canvas at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your RMD and your HTML files to the appropriate lab submission link.

Remember to: